This workshop will provide an introduction to the programming language commonly referred to as R!
R is a popular programming language that many researchers use for organizing data, visualizing data, and carrying out statistical analyses.
By the end of this workshop series, our hope is that you will feel comfortable enough to work independently in R!
[What people think coding is versus what it actually is]
Are you ready to start learning R?
Before the workshop, we’ll need to download R and RStudio. Throughout the workshop, we’ll be working in RStudio, which will allow us to write code in R. So let’s make sure we have both R and RStudio installed before we begin!
Download a R CRAN Mirror, which basically just hosts the R programming language that we will be using in RStudio. https://cran.r-project.org/
Download RStudio, which is the main software that we will be using to work with R. https://posit.co/download/rstudio-desktop/
Download the intro-to-coding-2025 folder from the TU-COG Github page (https://github.com/TU-Coding-Outreach-Group/intro-to-coding-2025) by pressing the green Code button and downloading the ZIP folder. This is the folder containing the all the files we will be working with for the purposes of this workshop.
Open up a new R Markdown document by clicking File > New File > R Markdown. First time R users will be asked to download packages once they open up an R Markdown file. Click “Yes” to downloading those packages!
First things first: if you haven’t already done so, open R Studio. Then, you’re going to want to open a new R Markdown document, by clicking File > New File > R Markdown…
This should produce a dialogue box where you can enter the name of the script and your name before selecting OK.
Next, let’s clear out all of the default text that appears in a new R Markdown document, which I have highlighted below:
In a typical coding script, every line must contain code that the language could interpret. If you want to include notes, you have to include a hash mark (#) before any code in order for the program to “ignore this line”, which can get a bit annoying. An R Markdown script is much more user friendly.
With R Markdown, any code that you would like R to interpret belongs in the coding chunk as illustrated below.
If we want to leave notes, we don’t have to “comment it out”. We can just write long-winded narration that can help others understand why we coded what we coded and what that code does.
That’s because a typical script will interpret any text as a command, unless the text is otherwise marked by a hashtag (#). An R markdown script only interprets things as code when we tell it to, and we tell it what is code by creating a chunk. Chunks are marked by three backticks (```) followed by a {r} and, on another line, three more backticks.
The output of any given chunk will appear just below that chunk,
rather than in the R Studio Console Window. By output, we just
mean the product, sum, or status of whatever calculation or item you are
asking R to compute and show you.
R Markdown grants us greater control over what we see and when we see it. To demonstrate, let’s start by creating a new chunk in our markdown document and entering what we see in the image above, you can then follow along with the next bit:
2 + 2
## [1] 4
With a typical script, if we want to know the output of a line we ran
awhile ago, we either have to rerun it or scroll through the console to
find it. With Markdown we can minimize entire chunks and their output by
using the minimization button [] on the left side of the window.
If we want to hide output, we can use the expand/collapse button
[] on the right
side of the output window.
We can choose exactly what we want to run using the the
“Run” command [] in the upper right corner of the chunk.
Also of note, the down-facing arrow (second icon in the upper right
corner of the code block) will tell R “Run all of the blocks of command
that I have before this block” []. It can be helpful if you make a
mistake and don’t want to manually rerun all of the previous blocks one
by one to get back to where you were. It also makes your code very easy
for other people to run. They can quite literally do it with the click
of a button!
If we click the cog icon in the same tray, we can access the output
options and manipulate where output appears and what it looks like, but
that’s beyond the scope of this review [].
“File paths” is just a fancy way to refer to the folders and subfolders that exist on someone’s computer. Understanding how to navigate file paths is really important for being able to access specific files that you want to work with!
Here’s an example of a file path: “/Users/tuh20985/Documents/GitHub/intro-to-coding-2025
Here’s a visualization of what that file path looks like on a computer:
Each part of the file path is a separate folder that we are traversing through.
We’ll revisit the concept of file paths when it’s time to set our working directory and find the data that we will be working with as part of the workshop.
Packages in R are synonymous with libraries in other languages. They are more or less convenient short-cuts or functions someone else already programmed to save us some work. Somebody else already figured out a very quick way to compute a function so now we don’t have to! We just use their tools to do it.
Every new package is centralized in R’s repository, so even though thousands of people are working on these things independently, you don’t need to leave R to find them. Before they can be used, they must be installed, and you can do that in a pretty way:
install.packages("PACKAGENAME")
If you need to update a package, you can just re-run the above code. If you’re using R Studio, you can also see a list of your packages and their associated descriptions in the ‘Packages’ Tab of your Viewer Window.
Now that we have installed a package, that doesn’t mean we can use it yet. We need to tell R “We want access to the functions this package has during this session” by calling it with the library() command.
library(PACKAGENAME)
Notice that we drop the quotation marks now. We just specify the (case-sensitive) package name and it lets R know we are planning on using that this session.
You might be wondering why we need to take this extra step. Sometimes different packages use the same commands, so having more than one of those active at the same time could confuse R (When this does happen, R will usually tell you). Sometimes packages take up a lot of disk space, so having ALL of your packages initialized at once might leave your computer running extremely slow. It’s the same for most languages.
If we ever want to explore the functions contained within a package in conjunction with examples, we can either go to the R documentation website or type ‘??PackageName’ into the Console, which will then populate the Help Tab of the Viewer Window with information on the package.
Let’s try installing and loading in a few package for practice. Let’s install and load the following packages in R: tidyverse, report, lme4, and ggplot2
For today’s workshop, we’ll be working with the palmerpenguins dataset. Please make sure you have the 2 CSV files (penguins.csv and penguins_raw.csv) already downloaded!
The palmerpenguins data contains two datasets:
The curated data, which contains size measurements for 344 penguins from three penguin species observed on three islands in the Palmer Archipelago, Antarctica. Although for the purposes of this workshop we’ve slightly modified the curated data such that there are data from 340 penguins :)
The raw data, which contains additional information about the penguins.
References palmerpenguins data originally published in:
Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
In order to start working with the palmerpenguins datasets, we need to revisit the concept of “file paths” that we talked about earlier. We’ll use our understanding of file paths to set our “working directory”. At this point, some people may be wondering what a “working directory” is.
A working directory refers to the specific file path on your computer where R (or any programming language) will look for files by default. Like any other language or program, R needs to be told where the file that we’d like to work with is located on our computer. It doesn’t just know automatically.
Below we’ll use the getwd() command to check out where where my current working directory is.
getwd() #get your current working directory
## [1] "/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/R"
As you can see, my working directory is currently set at: “/Users/tuh20985/Documents/GitHub/intro-to-coding-2025”
The output probably doesn’t make much sense to anyone because this working directory file path is specific to my computer! Each of our working directory file paths will be specific to the computer we are using.
In order to work with the palmerpenguins datasets, we need to tell R where the files are located. We can create a new object that stores a specific file path to make this process easier. File paths will differ based on whether you are using a Windows versus a Mac.
If you’re using a Windows computer, it’s likely your file path will exist within your “C:/ Drive”.
If you’re on a Mac, it’s likely your file path will start with a forward slash “/”. If you’re not sure of your path, R makes it relatively easy to find it.
# For Windows
Path <- "C:/"
# For Mac
Path <- "/"
On my computer, here’s where the palmerpenguins data exists:
In order to work with the penguins.csv file, I’ll need to set my working directory to this file path!
Let’s store this file path in an object called “Path”.
# For Windows
Path <- "C:/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/data/"
# For Mac
Path <- "/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/data/"
This format of assigning a value to an object, like we did with Path and “/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/data/” is really important and we’ll keep coming back to it throughout this tutorial.
We’ll next use this Path object to set our working directory using the setwd() command. The setwd() command tells R where to look for our .csv file.
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
Amazing! Now that our working directory is set to the correct file path, we can start working with the data!
Before we load in the palmerpenguins data, I want to highlight a little terminology. The data that R works with is always contained within what we call a ‘dataframe’. A dataframe represents the same thing that a spreadsheet represents in Excel. It contains many cells that are situated into columns (which have names) and rows (which may or may not have names).
There are many ways to load data into R and they all depend upon what
format the data is in. R can handle data from .csv, .xlsx, .txt, .html,
.json, SPSS, Stata, SAS, among others. R also has it’s own data format
(.RDA, .Rdata). With the exception of .RDA, .csv is often the cleanest
means of reading in data. We won’t cover the other formats, but they are
fairly exhaustively covered
Before reading in the palmerpenguins CSV file, we need to use the setwd() function to tell R where to look for our CSV file.
Let’s use the Path object that we created earlier to set our working directory to where the penguins.csv file is located on our computer.
In the most basic sense, we can load our penguins.csv data file using the read.csv() function like this:
penguins <- read.csv(file = "penguins.csv") #Load in the penguins CSV file and store it in a data frame called "penguins"
## Warning in file(file, "rt"): cannot open file 'penguins.csv': No such file or
## directory
## Error in file(file, "rt"): cannot open the connection
The read.csv() command actually loads in the data. If
done correctly, we should see our Environment populate with a
dataframe labeled penguins.
Since we’re all using the same dataset, the number of observations and variables should be the same as in the picture above. Here, you can think of observations as “rows” and variables as “columns”. If you click on penguins in the environment, it will open in a new tab of your Source Window (The same window you are likely writing script in) where you can view it.
We can also view the penguins data frame by using the View() command.
If we want to look at the first few rows of the penguins data frame, we can use the head() command.
head(penguins) #will show you a subset of rows within the Data Frame
View(penguins) #will open up the full data frame like you would in Excel
Amazing! Now that we can see the penguins dataset, let’s get a better idea of what each column represents
sample_number columns – Represents the continuous numbering sequence for each sample where data was collected
species column – The three types of penguin species
island column – The three types of island
bill_length_mm column – A continuous number denoting bill length in millimeters
bill_depth_mm column – An integer denoting bill depth in millimeters
flipper_length_mm column – An integer denoting flipper length in millimeters
body_mass_g column – A continuous number denoting body mass in grams
sex column – Sex of the penguin
year column – Year when the study took place
It’s super important to recognize the different data types that exist in R. Some of the common data types include: character, factor, double/numeric, integer.
You may have noticed that for some variables, like bill_length_mm, we wrote “continuous number”, whereas for other variables, like flipper_length_mm, we wrote “integer”. This is because R uses different data types, which we’ll talk about more below.
Let’s use the glimpse() function from the tidyverse library to print out the data type for each column in the penguins data frame.
glimpse(penguins)
## Error in glimpse(penguins): object 'penguins' not found
Put simply,
dbl references any number that reflects a continuous
number and can include decimals.
int reflects
integers, which only include whole numbers and do not include decimals.
We’ll also include descriptions of the character and factor data types.
chr reflects character, which includes string-based
text.
factor reflects factor, which includes texts
or numbers, but is typically used to reflect a category (Think “Low”,
“Medium”, “High” for texts or “1”, “2”, “3”, for rankings).
Now that we’ve got a better understanding of what data can look like in R, let’s start working with the data.
By looking at the penguins data frame, we can see that we aren’t working with a perfectly clean dataset: some of the rows have missing data! And we may not need all of the columns in the dataframe.
So how do we work with specific rows? How do we work with specific columns? And how can we check what data is missing? Learning how to access specific elements of a data frame is an extremely important part of learning R.
dataframe$column will print out all the rows in that column. Let’s print out all the penguin species IDs that exist in the data frame.
penguins$species
## Error in eval(expr, envir, enclos): object 'penguins' not found
What if we want to see a specific row? Let’s say row 2 within the species column? To reference a specific row in a given column, we can add brackets and the number of that row in the brackets:
The code below will print out the second row in the species column. We can manually confirm this by opening up the penguins data frame and looking at the second row in the species column.
penguins$species[2]
## Error in eval(expr, envir, enclos): object 'penguins' not found
However, we can also index the column using it’s relative position. Knowing that the species column is the first column, I can use bracket notation. Bracket notation is super helpful once you understand its structure. It helps me to think of it as [rows, columns]. Any number that appears before the comma will access rows, and any number that appears after the comma will access columns.
By including the name of the data frame before the bracket notation, we can pull certain rows and columns from that data frame
penguins[1,] # print the first row across all 8 columns
## Error in eval(expr, envir, enclos): object 'penguins' not found
penguins[,2] # print all the rows for column 2
## Error in eval(expr, envir, enclos): object 'penguins' not found
penguins[1,2] # print the first row in column 2
## Error in eval(expr, envir, enclos): object 'penguins' not found
Now that we know how to access rows and columns, let’s move onto subsetting!
Subsetting is a technique for filtering rows or columns in a given data frame in order to remove any rows/columns you are not interested in using.
We can subset rows and columns, but let’s start off with focusing on how we can subset specific rows.
We know that there are three species of penguins in this dataset: Adelie, Gentoo, and Chinstrap.
Let’s say we only wanted to focus on penguins from the Adelie species.
Thankfully, the filter() function from the tidyverse library makes subsetting a bit easier. Here, we will use the filter() function to subset the rows that represent penguins from the Adelie speces and store these rows in a new data frame called “penguins_adelie”.
penguins_adelie <- penguins %>%
filter(species == "Adelie")
## Error in filter(., species == "Adelie"): object 'penguins' not found
If you open up the penguins_adelie dataframe using View(penguins_adelie) or manually clicking on the new data object in the R environment window, you should see that all the rows in the species column of this data frame reflect the Adelie species.
Let’s break down 2 operators here that are particularly important.
%>% refers to piping, which lets you pass the result of one function directly as the input to the next function. Here, all we are doing is feeding in the penguins data frame into the filter() function.
Notice the two equals signs (==) in this part of the code: filter(species == “Adelie”)
When two value operators (=, >, <, !) are placed next to each other in R, and many other languages, we aren’t assigning a value to an object; we are comparing the values between two different objects. In this instance, using two equals signs, if the two values are equal, it would produce a TRUE value; if not, then a FALSE. This variable which can only take the value of either TRUE or FALSE is called a boolean. When we tell R to compare the value on the right with this specific column, what it is mechanically doing is iterating through each row within this column, comparing the column value, and determining whether the condition is TRUE or FALSE.
We can use this same strategy, piping (%>%) and value operators (==), to subset columns as well.
Let’s say we were only interested in examining bill length, and not as interested in examining bill depth, flipper length, body mass, sex, and year.
We can use the select() function from the tidyverse library to keep the sample_number, species, island, and bill_length_mm columns, while also removing the columns we are not interested in (bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year).
Here, we will use the select() function to keep the columns of interest (sample_number, species, island, bill_length_mm) in a new data frame called “penguins_bill_length”.
penguins_bill_length <- penguins %>%
select(sample_number, species, island, bill_length_mm)
## Error in select(., sample_number, species, island, bill_length_mm): object 'penguins' not found
If you open up the penguins_bill_length dataframe using View(penguins_bill_length) or manually clicking on penguins_bill_length in the R environment window, you should see that only the columns we wanted to keep (sample_number, species, island, bill_length_mm).
What if, rather than subsetting based on one condition (i.e., rows that represent penguins from the Adelie species), we wanted to subset based on multiple conditions?
We can take advantage of the OR ( | ) operator using the filter() function.
Let’s say we were only interested in examining Adelie or Gentoo penguins, but not Chinstrap penguins.
We can use the OR operator to tell R to subset all rows where species is equal to “Adelie” OR “Gentoo” and store it in a new data frame called “penguins_adelie_gentoo”.
penguins_adelie_gentoo <- penguins %>%
filter(species == "Adelie" | species == "Gentoo")
## Error in filter(., species == "Adelie" | species == "Gentoo"): object 'penguins' not found
If you open up the penguins_adelie_gentoo data frame, you can see that all the rows in the species column represent Adelie or Gentoo penguins, but not Chinstrap penguins
We can also take advantage of the AND ( & ) operator using the filter() function.
Let’s say we were only interested in examining penguins from the Adelie species and that lived on Biscoe island.
We can use the AND operator to tell R to subset all rows where species is equal to “Adelie” AND island is equal to “Biscoe” and store it in a new data frame called “penguins_adelie_biscoe”.
penguins_adelie_biscoe <- penguins %>%
filter(species == "Adelie" & island == "Biscoe")
## Error in filter(., species == "Adelie" & island == "Biscoe"): object 'penguins' not found
If you open up the penguins_adelie_biscoe data frame, you can see that all the rows in the species column represent Adelie and all the rows in the island column represent Biscoe.
As you can see, leveraging the OR ( | ) and AND ( & ) operators in conjunction with the filter() function can be especially powerful.
Let’s try an exercise where we have to subset data!
1) Create a new data frame called “penguin_chinstrap” and subset rows that represent penguins from the Chinstrap species.
'
'
[Click for solution](Click for solution
What if we wanted to see which rows had missing values (e.g., NA) or not? In other words, what if some penguins were missing information about their bill length?
Let’s go back to our original penguins data frame.
We can use the is.na() function to determine which rows have missing values in the bill_length_mm column.
is.na(penguins$bill_length_mm)
## Error in eval(expr, envir, enclos): object 'penguins' not found
This will produce an array of TRUEs and FALSEs of the same length as the rows in the dataframe, because each TRUE and FALSE is telling us whether each row in that column meets the condition we defined.
For example, we see that the first output is FALSE, which means that the first row in the bill_length_mm column does NOT have a missing value. We can confirm this by opening up the penguins data frame and going to the first row in the bill_length_mm column and seeing that it is not empty.
However, we see that the 4th output is TRUE, which means that the fourth row in the bill_length_mm column DOES have a missing value. We can confirm this by opening up the penguins data frame and going to the fourth row in the bill_length_mm column and seeing that it is indeed missing.
But how can we remove missing data (i.e., rows that are blank or have an ‘NA’ in it) from a data frame?
We can use the drop_na() function from the tidyverse library to create a new data frame called “penguins_complete” that only includes rows with no missing values.
penguins_complete <- penguins %>% drop_na()
## Error in drop_na(.): object 'penguins' not found
What if, instead of removing rows that have a missing value in ANY column, we wanted to remove any rows that have a missing value in ONE column? We can again use the drop_na() function and specify the specific column that we want to remove rows with missing values from.
Let’s create a new data framed called “penguins_bill_length_complete” that only includes rows with no missing values in the bill_length_mm column.
penguins_bill_length_complete <- penguins %>% drop_na(bill_length_mm)
## Error in drop_na(., bill_length_mm): object 'penguins' not found
As you can see, the penguins_bill_length_complete data frame only includes rows that do not have a missing value in the bill_length_mm column.
An if-else statement is a powerful control flow tool that allows a computer to make decisions and execute different code paths based on specific conditions.
It works kind of like a fork in the road: if a certain condition is true, the program follows one path of code; if the condition is false, it follows an alternative code path
It can be helpful to think of it as a digital version of decision-making -— “If it’s raining, I’ll take an umbrella; else, I’ll leave without one.
Here’s an example of what a general if-else expression looks like in R. This is similar to the fork analogy that we described earlier.
if (condition) {
expression
} else {
expression
}
We can map this structure onto R’s ifelse() function, which makes writing if-else statements a bit easier.
Let’s use an if-else statement to create a new column that represents whether an island existed in the Eastern or Western hemisphere of the world.
Let’s say that the Biscoe island existed in the Eastern hemisphere, and the Dream and Torgersen islands existed in the Western hemisphere.
The structure for ifelse() statements is as follows: if the value in the island column has a cell that is equal to “Biscoe”, R will insert a value of “Eastern” in the new hemisphere column for that cell, else, insert a value of “Western”.
We can use the mutate() function from the tidyverse to add the new “hemisphere” column to our penguins data frame.
# Use mutate() with ifelse() to add a new column called "hemisphere" that reflects whether the island was located on the Eastern or Western hemisphere
penguins <- penguins %>%
mutate(hemisphere = ifelse(penguins$island == "Biscoe", "Eastern", "Western"))
## Error in mutate(., hemisphere = ifelse(penguins$island == "Biscoe", "Eastern", : object 'penguins' not found
Here’s what our code would look like within the general if-else expression that we show above
#General structure of if statement
if (penguins$island == "Biscoe") {
penguins$hemisphere <- "Eastern"
} else {
penguins$hemisphere <- "Western"
}
We can also use a for-loop to accomplish the same outcome that we had when using the mutate() function with ifelse() function.
A for-loop is one of the main control-flow constructs of the R programming language. It is used to iterate over a collection of objects, such as a vector, a list, a matrix, or a dataframe, and apply the same set of operations on each item of a given data structure.
for (i in 1:length(penguins$island)) {
if (penguins$island[i] == "Biscoe") {
penguins$hemisphere[i] <- "Eastern"
} else {
penguins$hemisphere[i] <- "Western"
}
}
Lets break this for-loop code down in some more detail.
1) for (i in 1:length(penguins$island)) { — “i” is a temporary variable that store the values of the current position in the range of the for loop. In this case, we are telling R that we want “i” to represent each row within the length of the island column, starting at row 1 and going all the way down until the last row in the island column. “i” will iterate across each of these rows.
2) if (penguins$island[i] == “Biscoe”) { — This if statement is saying: if the value of i in the specific iteration of the island row is equal to “Biscoe”. Think about it as “if the first cell in the island row is equal to”Biscoe”, then if the second cell in the island row is equal to “Biscoe”, then if the third cell in the island row is equal to “Biscoe”, and so on..
3) penguins$hemisphere[i] <- “Eastern” — input a value of “Eastern” into the corresponding cell of the new “hemisphere” column.
4) } else { — the initiation of the else condition
5) penguins$hemisphere[i] <- “Western” – input a value of “Western” into the corresponding cell of the new “hemisphere” column
Merging data frames allows us to combine different datasets based on common variables. In other words, merging data together allows us to combine information from multiple sources into a single, comprehensive dataset.
To illustrate why merging is especially powerful, let’s start by reading in the penguins_raw CSV file, which contains additional information about the penguins.
penguins_raw <- read.csv(file = "penguins_raw.csv") #Load in the penguins_raw CSV file and store it in a data frame called "penguins_raw"
## Warning in file(file, "rt"): cannot open file 'penguins_raw.csv': No such file
## or directory
## Error in file(file, "rt"): cannot open the connection
In the original penguins dataset that we’ve been working with, we can examine bill length (bill_length_mm) and flipper length (flipper_length_mm), but let’s say we were really interested in examining culmen (the ridge along the top part of a penguin’s bill) length. As we can see, the original penguins dataset does not have this information, but the penguins raw dataset does!
We can combine these two data frames into one data frame, so that we have all the information in one place, and so that we can examine culmen length.
Let’s use the merge() function to merge the original penguins data frame and the penguins raw data frame into a new data frame called “penguins_merged”.
Here’s the general structure of the merge() function.
1) x and y: The two data frames you want to merge
2) by: The column(s) to merge on when the column name is the same in both data frames
penguins_merged <- merge(penguins, penguins_raw, by=c("sample_number", "species", "island", "flipper_length_mm", "body_mass_g", "sex"))
## Error in merge(penguins, penguins_raw, by = c("sample_number", "species", : object 'penguins' not found
In our code, x and y represent the data frames we want to merge. All of the columns listed in the by=c expression represent the columns that are shared across both data frames (i.e., species, bill_length_mm, flipper_length_mm, body_mass_g, sex), so that R knows which columns we want to match across.
In this way, R will then be able to match the columns that are NOT shared across the two data frames (i.e., bill_length_mm, bill_depth_mm, year, stage, clutch_competition, date_eg, culmen_length_mm, culmen_depth_mm, delta_15_n, delta_13_c, comments).
Now we have all the information we need in one place!
Next we will learn how to pivot data from wide to long, and long to wide.
Before we get started with pivoting, let’s subset a smaller data frame just to make it easier to work with.
Let’s say we only wanted to focus on the length of different anatomical regions (i.e., bill, flipper, culmen). We can subset a data frame with the following columns: sample_number, species, island, bill_length_mm, flipper_length_mm, and culmen_length_mm and store these columns in a new data frame called “penguins_length”.
penguins_length <- penguins_merged %>%
select(sample_number, species, island, bill_length_mm, flipper_length_mm, culmen_length_mm)
## Error in select(., sample_number, species, island, bill_length_mm, flipper_length_mm, : object 'penguins_merged' not found
Perfect! Next let’s break down what it means to pivot data.
When data exists in a “wide” format, that means that each individual only has 1 row.
When data exists in a “long” format, that means that each individual has more than 1 row.
For the purposes of understanding how to pivot data frames, we will be using the penguins_length data frame that we just created.
In order to do many common statistical analyses, data needs to be in long format. For example, let’s say we wanted to show how bill, flipper, and culmen length differ across penguin species.
If we wanted to make a graph to visualize this comparison (bill length vs flipper length vs culmen length), we could use the species column as our x-variable, but what would our Y variable be? There are 3 columns (1 for bill_length_mm, 1 for flipper_length_mm, and 1 for culmen_length_mm), but there is only 1 Y-variable. So as the data currently stands, we could not create a graph of these comparisons! We must pivot.
The pivot_longer() function from the tidyr package in R can be used to pivot a data frame from a wide format to a long format.
penguins_long <- penguins_length %>% pivot_longer(
cols=c("bill_length_mm", "flipper_length_mm", "culmen_length_mm"), #The names of the columns to pivot
names_to = "measure", #The name for the new character column
values_to = "length") #The name for the new values column
Here, we did a basic pivot longer to convert the data from wide to long, where the new column that we created (measure) reflects the column names that we included in the cols=c argument of the pivot_longer function. The other new column that we created, length, reflects the values that existed in the columns that we included in the cols=c argument of the pivot_longer function.
This data frame is in long format because each penguin has 3 rows (1 for bill length, 1 for flipper length, and 1 for culmen length)
Pivoting data can be a bit tricky to understand at first, but hopefully this example helps!
Data can also be converted from long to wide!
The pivot_wider() function from the tidyr package in R can be used to pivot a data frame from long format to wide format.
Let’s work on pivoting the long data frame that we created back to its original wide format!
For the purposes of this example, we’ll need to create a new column that reflects what we’ll call “ID” for each penguin. We can very easily assign an “ID” by adding a new column that reflects the number of each row.
#Pivot wider
penguins_wide <- penguins_long %>%
pivot_wider(names_from = measure, #names_from: The column whose values will be used as column names
values_from = length) #values_from: The column whose values will be used as cell values
## Error in pivot_wider(., names_from = measure, values_from = length): object 'penguins_long' not found
Here, we converted the data frame from long back to its original wide format! We know it’s in wide format because each penguin now only has 1 row.
By including the “measure” column in the “names_from” column, we are telling R that this is the column that will be used to generate column names. By including the “length” column in the “values_from” column, we are telling R that this is the column whose values will be used to generate row values.
1) Subset the following columns from the penguin_length data frame and store them in a new data frame called “penguin_pivot_exercise”: sample_number, species, bill_length_mm, bill_depth_mm
2) Pivot the data frame from wide to long and store it in a a new data frame called “penguin_pivot_exercise_long”
3) You should end up with 4 columns: sample_number, species, bill_measure, length
'
'
Next, we are going to start talking about different types of statistical analyses in R!
We’re going to focus on three common statistical analyses: t-tests, ANOVAs, and linear regressions.
We’ll continue our focus on comparing the length of different anatomical parts (bill, flipper, and culmen) across penguin species in our analyses below!
For our analyses, let’s use the penguins_length data frame that we created earlier which includes the following columns: sample_number, species, island, bill_length_mm, flipper_length_mm, culmen_length_mm.
To make things easier for our analyses, let’s remove any rows with missing values in the data.
penguins_length <- penguins_length %>% drop_na()
## Error in drop_na(.): object 'penguins_length' not found
A T-Test can be used when both the predictor variable consists of two categorical options and the outcome or dependent variable is numeric in value. A T-Test tells you how significant the differences between these categories or groups are. In other words, it lets you know if the differences between the means of two groups could have observed by chance. We could imagine a situation where an evil teacher told half of the class before a test the right chapter to study from and told the other half of the class the wrong chapter to study from. The two categories or groups might be Right Chapter and Wrong Chapter and the outcome variable would be Test Score. Using a T-Test, we could determine whether studying from the right chapter produces higher test scores.
Below, we propose a research question and t-test analysis for our penguins dataset.
QUESTION: Do Adelie and Gentoo penguins differ in flipper length?
HYPOTHESIS: “On average, Adelie and Gentoo penguins differ in flipper length.”
RELEVANT VARIABLES: Dependent: flipper_length_mm (numeric) Independent: species (Factor)
ANALYSIS: Unpaired Two-Samples T-Test
Now let’s do the t-test!
We’ll need to use conditional statements again to specify our variables. What we are comparing here are the mean values of flipper length for Adelie and Gentoo penguins As such, we are going to specify we want flipper length when species == “Adelie” and when species == “Gentoo We next have an argument which asks us whether this analysis is within-subjects or between-subjects. This question is between-subjects, since each penguin is either from the Adelie species or the Gentoo species, so we’ll mark that as FALSE. Lastly, R is asking us to define our alternative hypothesis, which is a little beyond the scope of this review, so you will have to take my word that “two.sided” is the right call.
model1 <- t.test(x = penguins_length$flipper_length_mm[penguins_length$species == "Adelie"],
y = penguins_length$flipper_length_mm[penguins_length$species == "Gentoo"],
paired = FALSE,
alternative = "two.sided")
## Error in t.test(x = penguins_length$flipper_length_mm[penguins_length$species == : object 'penguins_length' not found
#Print the model results
model1
## Error in eval(expr, envir, enclos): object 'model1' not found
We can observe the results of our t-test by printing the model1 data object that we created. However, R has the report() package which makes interpreting results a little easier!
Let’s use the report() function from the report library to help with our interpretation of the t-test results.
report(model1)
## Error in report(model1): object 'model1' not found
So it looks like our hypothesis panned out! We see statistically significant differences, judging by the model output (t(261.76) = -34.44, p-value < .001). Since penguins from the Adelie species were our reference group, the negative value of the T-statistic indicates that the first group (Adelie) has a lower mean than the second group (Gentoo). In other words, on average, penguins from the Gentoo species have a longer flipper length than penguins from the Adelie species.
Let’s switch to ANOVAs!
An ANOVA, or Analysis of Variance, can be used when both the predictor variable or variables consist of two or more categorical options and the outcome or dependent variable is numeric in value. Much like a T-Test, ANOVA tells you how significant the differences between these categories or groups are. The advantage over T-tests is that we can compare multiple groups or categories in one analysis. We could revisit our last horrible example and imagine that the evil teacher tells one group right chapter to study from, one group the wrong chapter to study from, and one group to not study at all. An ANOVA test will tell us whether any of these three groups are different from one another (but not necessarily which specific groups are different from one another).
Using the penguins_merged dataframe, we’ll run a similar analysis that we did for the t-test, but this time, we’ll include all three penguin species (Adelie, Gentoo, and Chinstrap)
QUESTION: Does bill length differ across penguin species (i.e., Adelie, Gentoo, Chinstrap)?
HYPOTHESIS: Differences will exist in bill length by penguin species.
RELEVANT VARIABLES: Dependent: bill_length_mm (numeric) Independent: species (Factor)
ANALYSIS: ANOVA
Pay close attention to the formatting of the syntax here. It is the standard way in which we specify most statistical models in R, whether for regression, ANOVA, hierarchical modeling etc.
model2 <- aov(bill_length_mm ~ species, data = penguins_length) #create ANOVA model and store in an object called model2
## Error in terms.formula(formula, "Error", data = data): object 'penguins_length' not found
#print summary of model results
summary(model2)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'model2' not found
The report function works equally well on ANOVA objects as well.
report(model2)
## Error in report(model2): object 'model2' not found
It seems that the main effect of species has a significant effect on the length of bills, judging by the model output (F(2, 339) = -410.60, p-value < .001)
Next, we’ll switch to linear regressions!
A bivariate linear regression can be used when both the predictor variable (X) and outcome variable (Y) consist of continuous numeric values. A linear regression tells us how predictive of Y that X is. In other words, if we measured temperature and ice cream sales, we might find, using linear regression that as temperature increases, we could predict with decent accuracy that ice cream sales would increase as well, and we could predict how many ice cream sales we expect to see for any one value of temperature.
Within the context of the penguins dataset, we could ask a question like whether the length of penguins’ bills predicts the length of their flippers.
QUESTION: Does bill length predict flipper length?
HYPOTHESIS: As bill length increases, flipper length will also increase.
RELEVANT VARIABLES: Dependent: Flipper length (numeric) Independent: Bill length (numeric)
ANALYSIS: Bivariate Linear Regression
Okay now it’s time to run our regression!
You’re going to notice right off the bat that the structure of the syntax here looks awfully similar to what we just did in ANOVA. We start by noting our method with the lm() function (Linear Modeling). We then note our outcome variable, add a tilde (~), note our predictor(s), and finally note our datasource.
#Store bivariate linear regression in an object called "m1"
m1 <- lm(flipper_length_mm ~ bill_length_mm, data = penguins_length)
## Error in is.data.frame(data): object 'penguins_length' not found
Just like ANOVA, we need to use the summary() function to read the data.
#Use summary() function to print summary for m1 bivariate linear model
summary(m1)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'm1' not found
Lastly, we can run the report() function for linear regression models too!
#Use report() function to print report for m1 bivariate linear model
report(m1)
## Error in report(m1): object 'm1' not found
Linear regression results show that bill length statistically predicts flipper length, as observed by model output stats: (beta = 1.67, 95% CI [1.46, 1.88], t(331) = 15.69, p < .001).
Whereas bivariate linear regression uses 1 independent predictor and 1 dependent variable (hence, “bi”), a multivariate linear regression builds upon bivariate regression by allowing for more than 1 predictors (hence “multi”). If we measured temperature, ice cream sales, and the time since someone last ate, we might find that our previously specified model is now even more accurate because of the addition of the “time since last ate” variable.
Using the penguins data, let’s examine if bill length still predicts flipper length when controlling for differences in sex.
QUESTION: Does bill length predict flipper length when accounting for the effect of sex?
HYPOTHESIS: Bill length still predicts flipper length when controlling for sex.
RELEVANT VARIABLES: Dependent: flipper length (numeric) Independent: Bill length (numeric) Independent: sex (factor)
ANALYSIS: Multivariate Linear Regression
#Store multivariate linear regression in an object called "m2
m2 <- lm(flipper_length_mm ~ bill_length_mm + sex, data = penguins_bill_flipper) "
## Error: <text>:3:82: unexpected INCOMPLETE_STRING
## 3: m2 <- lm(flipper_length_mm ~ bill_length_mm + sex, data = penguins_bill_flipper) "
## 4:
## ^
Just like the bivariate linear regression, we need to use the summary() function to read the data.
#Use summary() function to print summary for m2 multivariate linear model
summary(m2)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': object 'm2' not found
Lastly, we can run the report() function for linear models as well!
#Use report() function to print report for m2 multivariate linear model
report(m2)
## Error in report(m2): object 'm2' not found
Linear regression results show that bill length still statistically predicts flipper length when controlling for differences in sex, as observed by model output stats: (beta = 1.64, 95% CI [1.42, 1.87], t(330) = 14.46, p < .001).
Hopefully these analyses have made sense! Our hope is that you can use this analysis code to conduct your own analyses in your own research!
Next we’ll start talking about how to make graphs in R!
ggplot2 is an extremely popular plotting package in R that makes it fairly easy to create complex plots from data in a data frame.
ggplot2 refers to the name of the package itself, whereas we use the function ggplot() to generate the plots. We’re going to start off with building a very simple plot, and then we will add in some more lines to organize a plot like you would for a manuscript/publication!
We can start off by generating a histogram to visualize our dataset and see the distribution of flipper length by penguin species.
ggplot(penguins_length, aes(x=flipper_length_mm, fill=species)) + geom_histogram()
## Error in ggplot(penguins_length, aes(x = flipper_length_mm, fill = species)): object 'penguins_length' not found
Histograms are a very helpful way to visualize data and better understand the distribution of specific variables of interest, and can also be helpful for visually identifying outliers.
Next, we’ll move onto creating bar plots and adding standard error bars to graphs!
Let’s plot the ANOVA we ran earlier, where we examined whether bill length differed across the three penguin species.
In order to create a bar plot with standard error bars, we have to first run the summarySE() function that will allow us to calculate the standard error for our dependent variable (bill length).
As a quick reminder, standard error bars reflect the uncertainty or variability in the mean value that’s being displayed. In other words, they indicate how precisely we know the true population mean - a smaller standard error suggests we can be more confident in our estimate, while larger error bars indicate more uncertainty. The standard error is calculated by dividing the standard deviation by the square root of the sample size.
The summarySE function is appropriate when working with between-subjects variables. If you have within-subjects variables and want to adjust the error bars so that inter-subject variability is removed as suggested in Loftus and Masson (1994), then the other two functions, normDataWithin and summarySEwithin must also be added to your code; summarySEwithin will then be the function that you call.
You don’t have to change anything in this code – just run it as it is! All this code is doing is authorizing the use of the summarySE() function.
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
## data: a data frame.
## measurevar: the name of a column that contains the variable to be summariezed
## groupvars: a vector containing names of columns that contain grouping variables
## na.rm: a boolean that indicates whether to ignore NA's
## conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95, .drop=TRUE) {
library(plyr)
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
Now that we have the summarySE() function loaded in, we can calculate the standard error of bill length for each penguin species.
Below, we will use the summarySE() function where the measurevar argument reflects our dependent variable (bill_length_mm) and the groupvars argument reflects our independent variable (species).
# summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
penguins_length.se <- summarySE(penguins_length, measurevar="bill_length_mm", groupvars=c("species"))
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
## Error in empty(.data): object 'penguins_length' not found
#We can print the new data frame that we created to see what information is stored there
print(penguins_length.se)
## Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'print': object 'penguins_length.se' not found
Great! We can see that the penguins_length.se data frame reflects the mean, standard deviation, standard error, and confidence intervals for our dependent variable of interest (flipper length) across the three penguin species.
Now that we have the standard errors calculated, let’s make a bar plot with standard errors to compare flipper length across the three penguin species.
ggplot(penguins_length.se, aes(x = species, y = bill_length_mm, fill = species)) + #Plot the variables using the penguins_length.se data frame
geom_bar(stat="identity") + #Use the geom_bar() function to generate a bar plot
geom_errorbar(aes(ymin=bill_length_mm-se, ymax=bill_length_mm+se), #Use the geom_errorbar() function to plot the standard error bars
width=.2, # Define the width of the error bars
position=position_dodge(.9))
## Error in ggplot(penguins_length.se, aes(x = species, y = bill_length_mm, : object 'penguins_length.se' not found
As a visual confirmation of our ANOVA results, we can see that there is indeed differences in bill length across the penguin species – most notably with the bill length of the Gentoo and Chinstrap penguins differing from the bill length of Adelie penguins.
Next, we’ll visualize the bivariate linear regression that we ran above, where we examined whether bill length predicted flipper length in penguins.
We can use a scatterplot to visualize this relationship and add a linear regression line to the plot.
ggplot(penguins_length, aes(x=bill_length_mm, y=flipper_length_mm)) + #Plot the variables we are interested in
geom_point() + #geom_point() is the function used to create scatter plots
geom_smooth(method=lm, color="black") #Adds a linear regression line to your plot
## Error in ggplot(penguins_length, aes(x = bill_length_mm, y = flipper_length_mm)): object 'penguins_length' not found
Amazing! Now we have a visual representation of what our bivariate linear regression results look like. It’s clear that as bill length increases, so does flipper length.
Hopefully this part has been helpful for learning how to visualize data!
So far, we’ve covered quite a lot in separate sections! To end the R workshop, we wanted to put together a complete pipeline that you can use to analyze your own data in the future, starting with reading data into R, cleaning it up, running statistical analyses, and then visualizing the results.
Let’s say our objective was to examine whether bill length significantly differs from flipper length in penguins.
Here’s what a complete pipeline could look like to address this question using the tidyverse() library in R!
#Step 1: Load the R packages you will be working with (this assumes that you've previously installed the R packages).
library(tidyverse)
library(report)
library(lme4)
library(ggplot2)
#Step 2: Read the data into R
penguins <- read.csv(file = "penguins.csv") #Load in the penguins CSV file and store it in a data frame called "penguins"
#Step 3: Subset the columns you are interested in using the select() function
penguins_length_pipeline <- penguins %>%
select(sample_number, species, island, bill_length_mm, flipper_length_mm)
#Step 4: Pivot data frame from wide to long
penguins_pipeline_long <- penguins_length_pipeline %>% pivot_longer(
cols=c("bill_length_mm", "flipper_length_mm"), #The names of the columns to pivot
names_to = "anatomical_region", #The name for the new character column
values_to = "length") #The name for the new values column
#Step 5: Clean your data and remove any rows with missing values
penguins_pipeline_long <- penguins_pipeline_long %>% drop_na()
#Step 6: Run a t-test to determine whether bill length significantly differs from flipper length
model2 <- t.test(x = penguins_pipeline_long$length[penguins_pipeline_long$anatomical_region == "bill_length_mm"],
y = penguins_pipeline_long$length[penguins_pipeline_long$anatomical_region == "flipper_length_mm"],
paired = FALSE,
alternative = "two.sided")
#Step 7: Interpret results of t-test output
report(model2)
#Step 8: Calculate the standard errors before plotting
#summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
penguins_pipeline_long.se <- summarySE(penguins_pipeline_long, measurevar="length", groupvars=c("anatomical_region"))
#Step 9: Create bar plot with standard error bars to visualize t-test
ggplot(penguins_pipeline_long.se, aes(x = anatomical_region, y = length, fill = anatomical_region)) +
geom_bar(stat="identity") +
geom_errorbar(aes(ymin=length-se, ymax=length+se),
width=.2,
position=position_dodge(.9))
#Step 10: All done!!!
Congrats! You made it through an R workshop! Hopefully you feel like you’re starting to pick up on some things! It’s also totally okay if things aren’t clicking just yet, but regardless of how you may be feeling, I will always be happy to work with you if you have any specific questions as you move forward on your coding journey. My email is stevent.martinez@temple.edu or you can always send me a Slack!